In previous notebooks, we have assigned each cell to its condition (0h, 8h, etc.), we have filtered low-quality cells, and we have normalized gene counts to correct for biases such as differences in library size. The result of that is a SingleCellExperiment object that we saved as .RDS and that will be our starting point in this analysis.
Here, we aim to cluster cells to identify each cell type. Hence, we are going to use Seurat, a CRAN package that has become a swiss-knife for scRNA-seq analysis. As described in Kiselev et al, Seurat uses a graph-based clustering algorithm that is scalable to datasets with thousands of cells. Therefore, we will leverage such scalability to cluster >11,000 cells contained in the SCE object.
library(stringr)
library(psych)
library(pheatmap)
library(fitdistrplus)
library(ggpubr)
library(SingleCellExperiment)
library(scater)
library(Seurat)
library(purrr)
library(grid)
library(gridExtra)
library(gridGraphics)
library(cowplot)
library(matchSCore2)
library(nnet)
library(Matrix)
library(tidyverse)
Seurat uses its own single-cell data container, a so-called Seurat object. Hence, we first need to convert the SCE to this new data structure:
date <- Sys.Date()
# Load SingleCellExperiment
sce <- readRDS("results/R_objects/10X_SingleCellExperiment_filt&norm.RDS")
# To increase interpretability downstream, change rownames from ensembl to gene
# symbol
rowData(sce)$name %>%
table() %>%
sort(decreasing = TRUE) %>%
head(10)
## .
## PNRC2 SRSF10 7SK A1BG A2M-AS1 AAAS AACS AAED1 AAGAB AAK1
## 2 2 1 1 1 1 1 1 1 1
ind <- match(c("PNRC2", "SRSF10"), rowData(sce)$name)
rowData(sce)$name[ind] <- c("PNRC2.1", "SRSF10.1")
rownames(sce) <- rowData(sce)$name
# Convert SCE to Seurat
seurat <- Convert(from = sce, to = "seurat")
To cluster our cells, we need to overcome 2 challenges:
A first approach to tackle these challenges consists of finding the most variable genes as a means of feature selection. That is, to find the subset of genes that drive most of the variability in the expression matrix. Seurat calculates the average expression and dispersion for each gene. Then, it divides genes into bins based on its average, and for each bin computes a z-score per gene. Those genes with a z-score above a certain cutoff are categorized as highly variable. The binning step is vital, since genes with more UMI tend to have more dispersion.
seurat <- FindVariableGenes(seurat, display.progress = FALSE)
length(seurat@var.genes)
## [1] 319
As we can see, we reduce the number of dimensions from >10,000 genes to 319 HVG.
An important pre-processing step in any cluster analysis is to scale the data, as otherwise variables with a higher mean will have a higher weight in the distance metric. Remember that we observed batch effects, so we regress out the âbatchâ variable:
seurat <- ScaleData(seurat, vars.to.regress = "batch")
## NormalizeData has not been run, therefore ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
##
## Time Elapsed: 23.7365520000458 secs
An additional challenge in our cluster analysis is that scRNAs-seq is very noisy (very susceptible to technical artifacts), and very sparse (contains drop-outs). Thus, differences in single genes may not be accurate to identify cell types. To that end, we can perform Principal Component Analysis, as PC can be conceived as a âmetageneâ that includes information across a correlated gene set. Furthermore, we will reduce the dimensionality even more.
seurat <- RunPCA(
object = seurat,
pc.genes = seurat@var.genes,
do.print = TRUE,
pcs.print = 1:5,
genes.print = 5
)
## [1] "PC1"
## [1] "LYZ" "FCN1" "S100A9" "CTSS" "LST1"
## [1] ""
## [1] "KLRB1" "NKG7" "CCL5" "CTSW" "GNLY"
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "RPL34" "CD79A" "MS4A1" "RGCC" "RGS10"
## [1] ""
## [1] "NKG7" "GNLY" "CST7" "PRF1" "FGFBP2"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "FGL2" "CD74" "MNDA" "NCF1" "HLA-DQA1"
## [1] ""
## [1] "C15orf48" "G0S2" "PLIN2" "THBS1" "IL8"
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "S100A8" "S100A9" "S100A12" "VCAN" "MNDA"
## [1] ""
## [1] "CD79A" "MS4A1" "CD74" "HLA-DPB1" "CD79B"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "H3F3B" "SYTL3" "ZNF331" "TSPYL2" "PMAIP1"
## [1] ""
## [1] "TIMP1" "FCGR3A" "ACTB" "C15orf48" "G0S2"
## [1] ""
## [1] ""
VizPCA(object = seurat, pcs.use = 1:2)
PCHeatmap(
object = seurat,
pc.use = 1,
cells.use = 500,
do.balanced = TRUE,
label.columns = FALSE
)
To determine the number of significant PCs to use, we will take advantage of a scree plot. That is, we will plot the variance explained by each PC in an ordered manner. We identify the number of significant PCs as the one in which we can observe an âelbowâ (i.e. the reduction in explained variance diminishes):
PCElbowPlot(seurat)
The elbow is in PC4, so that is what we are going to use to cluster cells using the FindClusters function.
Seurat uses the Louvain algorithm to cluster cells:
seurat <- FindClusters(
object = seurat,
reduction.type = "pca",
dims.use = 1:4,
resolution = 0.2,
print.output = 0
)
We can visualize the former clusters with a t-Stochastic Neighbor Embedding (tSNE), which allows to depict more structure in the data than PCA:
set.seed(1)
seurat <- RunTSNE(seurat, dims.use = 1:4)
TSNEPlot(seurat)
As we can see, there are 4 major clusters. Interestingly:
seurat@meta.data$cluster <- seurat@ident
seurat <- SetAllIdent(seurat, "batch")
TSNEPlot(seurat)
Regressing out the batch effect with the ScaleData function from above removed the majority of the batch effect.
Let us find the markers of each of the clusters above. In other words, let us find which genes are exclusively expressed in each cluster and will help us identify the cell types in our data set:
seurat <- SetAllIdent(seurat, "cluster")
cluster_markers <- map(0:4, ~ FindMarkers(seurat, ident.1 = ., mon.pct = 0.25))
map(cluster_markers, head, 10)
## [[1]]
## p_val avg_logFC pct.1 pct.2 p_val_adj
## LTB 0 0.9528446 0.802 0.283 0
## IL7R 0 0.7390348 0.632 0.160 0
## LDHB 0 0.7090367 0.764 0.345 0
## RPS6 0 0.6259373 1.000 0.995 0
## RPS12 0 0.6220858 1.000 0.995 0
## RPL32 0 0.5620295 1.000 0.997 0
## RPS25 0 0.5594435 0.999 0.981 0
## RPL13 0 0.5284821 1.000 0.998 0
## RPL36 0 0.5267855 0.999 0.972 0
## RPL34 0 0.5075244 1.000 0.998 0
##
## [[2]]
## p_val avg_logFC pct.1 pct.2 p_val_adj
## GZMK 0.000000e+00 1.2261217 0.438 0.049 0.000000e+00
## CCL5 0.000000e+00 0.7589963 0.838 0.290 0.000000e+00
## DUSP2 1.803133e-237 0.6926996 0.497 0.184 1.812148e-233
## LYAR 7.330262e-236 0.6055701 0.495 0.180 7.366914e-232
## CD3D 8.937610e-179 0.6287266 0.693 0.424 8.982298e-175
## FCER1G 1.106223e-163 -1.4947776 0.078 0.357 1.111754e-159
## CD8A 1.496912e-153 0.4389741 0.219 0.050 1.504397e-149
## CXCR4 4.408869e-132 0.4993092 0.818 0.655 4.430913e-128
## HLA-B 1.293942e-125 0.2668381 0.992 0.972 1.300412e-121
## KLRB1 1.129545e-121 0.7471007 0.618 0.399 1.135192e-117
##
## [[3]]
## p_val avg_logFC pct.1 pct.2 p_val_adj
## GNLY 0 2.743886 0.969 0.258 0
## NKG7 0 2.715207 0.997 0.301 0
## GZMB 0 2.566169 0.866 0.086 0
## PRF1 0 2.195510 0.741 0.092 0
## FGFBP2 0 2.154904 0.763 0.061 0
## GZMA 0 2.002425 0.824 0.192 0
## CST7 0 1.946414 0.876 0.181 0
## CCL4 0 1.822190 0.602 0.068 0
## KLRF1 0 1.817804 0.625 0.043 0
## CTSW 0 1.810281 0.838 0.236 0
##
## [[4]]
## p_val avg_logFC pct.1 pct.2 p_val_adj
## S100A9 0 4.723944 0.882 0.066 0
## LYZ 0 4.694222 0.984 0.061 0
## S100A8 0 4.533457 0.778 0.046 0
## FTL 0 3.101847 1.000 0.927 0
## CST3 0 2.953535 0.973 0.046 0
## SAT1 0 2.742091 0.969 0.333 0
## AIF1 0 2.642039 0.888 0.086 0
## LST1 0 2.576841 0.944 0.058 0
## CTSS 0 2.553252 0.955 0.189 0
## FTH1 0 2.435048 1.000 0.953 0
##
## [[5]]
## p_val avg_logFC pct.1 pct.2 p_val_adj
## IGLL5 0 4.498362 0.399 0.013 0
## IGJ 0 2.712879 0.407 0.015 0
## CD74 0 2.640941 1.000 0.433 0
## CD79A 0 2.070005 0.814 0.009 0
## HLA-DRA 0 1.789626 0.985 0.197 0
## MS4A1 0 1.659741 0.686 0.005 0
## HLA-DQB1 0 1.538014 0.875 0.139 0
## HLA-DPB1 0 1.535158 0.958 0.271 0
## HLA-DRB1 0 1.449467 0.967 0.205 0
## TCL1A 0 1.384979 0.349 0.001 0
Based on the previously found markers, we can annotate each cluster to known cell types:
| Cluster ID | Markers | Cell Type |
|---|---|---|
| 0 | IL7R | CD4 T cells |
| 1 | CD8A | CD8 T cells |
| 2 | GNLY, NKG7 | Natural Killer (NK) |
| 3 | LYZ | Monocytes |
| 4 | MS4A1 | B cells |
new_cluster_ids <- c("CD4 T", "CD8 T", "NK", "Monocyte", "B")
levels(seurat@ident) <- new_cluster_ids
markers_gg_l <- FeaturePlot(
object = seurat,
features.plot = c("IL7R", "CD8A", "GNLY", "LYZ", "MS4A1"),
cols.use = c("grey", "blue"),
do.return = TRUE
)
markers_gg_l <- map(markers_gg_l, function(gg) {
gg +
theme(plot.title = element_text(face = "plain", size = 12),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title = element_blank())
})
markers_gg <- ggarrange(
plotlist = markers_gg_l,
ncol = 2,
nrow = 3,
labels = "auto",
font.label = list(size = 14)
)
ggsave(
filename = str_c("results/plots/", date, "_cell_type_markers.pdf"),
plot = markers_gg,
device = "pdf",
width = 10,
height = 10
)
markers_gg
tsne_cell_types <- TSNEPlot(seurat)
tsne_cell_types <- tsne_cell_types +
scale_color_manual(values = c("#c20a35", "#aa2edc", "#71bdd0", "#bbaa2a", "chartreuse3"),
labels = c("CD4 T", "CD8 T", "NK", "Monocytes", "B")) +
theme(axis.line = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
panel.border = element_blank(),
legend.text = element_text(size = 13))
ggsave(
filename = str_c("results/plots/", date, "_cell_type_tsne.pdf"),
plot = tsne_cell_types,
device = "pdf",
width = 11,
height = 10
)
tsne_cell_types
# Clustering figure
legend <- as_ggplot(get_legend(tsne_cell_types))
tsne_cell_types <- tsne_cell_types +
theme(legend.position = "none")
clustering_figure <- plot_grid(
markers_gg,
NULL,
tsne_cell_types,
ncol = 3,
nrow = 1,
rel_widths = c(0.9, 0.05, 1.1),
labels = c("", "", "f"),
label_size = 14
)
ggsave(
filename = str_c("doc/figures/R/", date, "_clustering_figure.pdf"),
plot = clustering_figure,
device = "pdf",
width = 19,
height = 11,
units = "cm"
)
ggsave(
filename = str_c("doc/figures/legends/", date, "legend_clustering_figure.pdf"),
plot = legend,
device = "pdf",
width = 2,
height = 2
)
# Convert to SCE
sce2 <- Convert(from = seurat, to = "sce")
# Recode colData variables
colData(sce2) <- colData(sce2)[, c("batch", "donor", "ident", "condition")]
conds <- c("0h", "2h", "8h", "24h_RT", "48h_RT", "24h_4C", "48h_4C")
sce2$condition <- factor(sce2$condition, levels = conds)
levels(sce2$condition) <- conds %>%
str_remove("RT") %>%
str_remove("_")
sce2$temperature <- case_when(
sce2$condition == "0h" ~ "gold",
sce2$condition %in% c("2h", "8h", "24h", "24hBioabank", "48h") ~ "room temperature",
sce2$condition %in% c("24h4C", "48h4C") ~ "4ÂșC"
)
sce2$condition <- str_remove(as.character(sce2$condition), "4C")
colnames(colData(sce2)) <- c("batch", "sex", "cell_type", "time", "temperature")
# Recode rowData variables
rowData(sce2) <- rowData(sce)
# Save as RDS
saveRDS(sce2, "results/R_objects/10X_SingleCellExperiment_clustered.RDS")
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.4.0 dplyr_0.8.0.1 readr_1.3.1 tidyr_0.8.3 tibble_2.1.1 tidyverse_1.2.1 nnet_7.3-12 matchSCore2_0.0.0.9000 gridGraphics_0.3-0 gridExtra_2.3 purrr_0.3.2 Seurat_2.3.4 Matrix_1.2-17 cowplot_0.9.4 scater_1.10.1 SingleCellExperiment_1.4.1 SummarizedExperiment_1.12.0 DelayedArray_0.8.0 BiocParallel_1.16.6 matrixStats_0.54.0 Biobase_2.42.0 GenomicRanges_1.34.0 GenomeInfoDb_1.18.2 IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0 ggpubr_0.2 magrittr_1.5 ggplot2_3.1.0 fitdistrplus_1.0-14 npsurv_0.4-0 lsei_1.2-0 survival_2.44-1.1 MASS_7.3-51.3 pheatmap_1.0.12 psych_1.8.12 stringr_1.4.0 BiocStyle_2.10.0
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 snow_0.4-3 backports_1.1.3 Hmisc_4.2-0 plyr_1.8.4 igraph_1.2.4 lazyeval_0.2.2 splines_3.5.1 digest_0.6.18 foreach_1.4.4 htmltools_0.3.6 viridis_0.5.1 lars_1.2 gdata_2.18.0 checkmate_1.9.1 cluster_2.0.7-1 mixtools_1.1.0 ROCR_1.0-7 modelr_0.1.4 R.utils_2.8.0 colorspace_1.4-1 rvest_0.3.2 haven_2.1.0 xfun_0.5 jsonlite_1.6 crayon_1.3.4 RCurl_1.95-4.12 zoo_1.8-5 iterators_1.0.10 ape_5.3 glue_1.3.1 gtable_0.3.0 zlibbioc_1.28.0 XVector_0.22.0 kernlab_0.9-27 Rhdf5lib_1.4.3 prabclus_2.2-7 DEoptimR_1.0-8 HDF5Array_1.10.1 scales_1.0.0 mvtnorm_1.0-10 bibtex_0.4.2 Rcpp_1.0.1 metap_1.1 dtw_1.20-1 viridisLite_0.3.0 htmlTable_1.13.1
## [48] reticulate_1.11.1 foreign_0.8-71 bit_1.1-14 proxy_0.4-23 mclust_5.4.3 SDMTools_1.1-221 Formula_1.2-3 tsne_0.1-3 htmlwidgets_1.3 httr_1.4.0 gplots_3.0.1.1 RColorBrewer_1.1-2 fpc_2.1-11.1 acepack_1.4.1 modeltools_0.2-22 ica_1.0-2 pkgconfig_2.0.2 R.methodsS3_1.7.1 flexmix_2.3-15 labeling_0.3 tidyselect_0.2.5 rlang_0.3.3 reshape2_1.4.3 cellranger_1.1.0 munsell_0.5.0 tools_3.5.1 cli_1.1.0 generics_0.0.2 broom_0.5.1 ggridges_0.5.1 evaluate_0.13 yaml_2.2.0 knitr_1.22 bit64_0.9-7 robustbase_0.93-4 caTools_1.17.1.2 RANN_2.6.1 pbapply_1.4-0 nlme_3.1-137 R.oo_1.22.0 xml2_1.2.0 hdf5r_1.0.1 compiler_3.5.1 rstudioapi_0.10 png_0.1-7 beeswarm_0.2.3 stringi_1.4.3
## [95] lattice_0.20-38 trimcluster_0.1-2.1 pillar_1.3.1 BiocManager_1.30.4 Rdpack_0.10-1 lmtest_0.9-36 data.table_1.12.0 bitops_1.0-6 irlba_2.3.3 gbRd_0.4-11 R6_2.4.0 latticeExtra_0.6-28 bookdown_0.9 KernSmooth_2.23-15 vipor_0.4.5 codetools_0.2-16 gtools_3.8.1 assertthat_0.2.1 rhdf5_2.26.2 withr_2.1.2 mnormt_1.5-5 GenomeInfoDbData_1.2.0 hms_0.4.2 diptest_0.75-7 doSNOW_1.0.16 rpart_4.1-13 class_7.3-15 rmarkdown_1.12 DelayedMatrixStats_1.4.0 segmented_0.5-3.0 Rtsne_0.15 lubridate_1.7.4 base64enc_0.1-3 ggbeeswarm_0.6.0